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Abstract 

We revisit the additive model learning literature and adapt a penalized spline for- 
mulation due to Eilers and Marx to train additive classifiers efficiently. We 
also propose two new embeddings based two classes of orthogonal basis with 
orthogonal derivatives, which can also be used to efficiently learn additive classi- 
fiers. This paper follows the popular theme in the current literature where kernel 
SVMs are learned much more efficiently using a approximate embedding and lin- 
ear machine. In this paper we show that spline basis are especially well suited 
for learning additive models because of their sparsity structure and the ease of 
computing the embedding which enables one to train these models in an online 
manner, without incurring the memory overhead of precomputing the storing the 
embeddings. We show interesting connections between B-Spline basis and his- 
togram intersection kernel and show that for a particular choice of regularization 
and degree of the B-Splines, our proposed learning algorithm closely approxi- 
mates the histogram intersection kernel SVM. This enables one to learn additive 
models with almost no memory overhead compared to fast a linear solver, such 
as LIBLINEAR, while being only 5 — 6x slower on average. On two large scale 
image classification datasets, MNIST and Daimler Chrysler pedestrians, the pro- 
posed additive classifiers are as accurate as the kernel SVM, while being two or- 
ders of magnitude faster to train. 

1 Introduction 

Non parametric models for classification have become attractive since the introduction of kernel 
methods like the Support Vector Machines (SVMs) 1 1 1. The complexity of the learned models scale 
with the data, which gives them desirable asymptotic properties. However from an estimation point 
of view, parametric models can offer significant statistical and computational advantages. Recent 
years has seen a shift of focus from non-parametric models to semi-parametric for learning clas- 
sifiers. This includes the work of Rahimi and Recht fTSl, where they compute an approximate 
feature map (f), for shift invariant kernels K{\x — ij\) ^ 4){x)' (t>{y) , and solve the kernel problem 
approximately using a linear problem. This line of work has become extremely attractive, with 
the advent of several algorithms for training linear classifiers efficiently (for e.g. LIBLINEAR |i6l, 
PEGASOS |fT6l ), including online variants which have very low memory overhead. 

Additive models, i.e., functions that decompose over dimensions {f{x) — ^ ■ fi{xi)), are a natural 
extension to linear models, and arise naturally in many settings. In particular if the kernel is additive, 
i.e. K{x, y) — Ki{xi, yi) then the learned SVM classifier is also additive. A large number of 
useful kernels used in computer vision are based on comparing distributions of low level features on 
images and are additive, for e.g., histogram intersection and kernel 111]. This one dimensional 
decomposition allows one to compute approximate feature maps independently for each dimension, 
leading to very compact feature maps, making estimation efficient. This line of work has been 
explored by Maji and Berg [ 1 1 „| where they construct approximate feature maps for the min kernel, 
and learn piecewise linear functions in each dimension. For 7-homogenous additive kernels, Vedaldi 
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and Zisserman fVl] propose to use the closed form features of Hein and Bousquet |8 1 to construct 
approximate feature maps. 

Smoothing splines ifTsl are another way of estimating additive models, and are well known in the 
statistical community. Ever since Generalized Additive Models (GAMs) were introduced by Hastie 
and Tibshirani |7 1, many practical approaches to training such models for regression have emerged, 
for example the P-Spline formulation of Eilers and Marx |4|. However these algorithms do not scale 
to extremely large datasets and high-dimensional features typical of image or text classification 
datasets. 

In this work we show that the spline framework can used to derive embeddings to train additive clas- 
sifiers efficiently as well. We propose two families of embeddings which have the property that the 
underlying additive classifier can be learned directly by estimating a linear classifier in the embedded 
space. The first family of embeddings are based on the Penalized Spline ("P-Spline") formulation 
of additive models (Eilers and Marx [4|) where the function in each dimension is represented using 
a uniformly spaced spline basis and the regularization penalizes the difference between adjacent 
spline coefficients. The second class of embeddings are based on a generalized Fourier expansion 
of the function in each dimension. 

This work ties the Uterature of additive model regression and linear SVMs to develop algorithms 
to train additive models in the classification setting. We discuss how our additive embeddings are 
related to additive kernels in Section |4] In particular our representations include those of [11] as a 
special case arising from a particular choice of B-Spline basis and regularization. An advantage of 
our representations is that it allows explicit control of the smoothness of the functions and the choice 
of basis functions, which may be desirable in certain situations. Moreover the sparsity of some of 
our representations lead to efficient training algorithms for smooth fits of functions. We summarize 
the previous work in the next section. 

2 Previous Work 

The history of learning additive models goes back to Q, who proposed the "backfitting algorithm" 
to estimate additive models. Since then many practical approaches have emerged, the most promi- 
nent of which is the Penalized Spline formulation ("P-Spline") proposed by |4|, which consists of 
modeling the one dimensional functions using a large number of uniformly spaced B-Spline basis. 
Smoothness is ensured by penalizing the differences between adjacent spline coefficients. We de- 
scribe the formulation in detail in Section |3] A key advantage of this formulation was the whole 
problem could be solved using a linear system. 

Given data {x^ ,y^)^ k = 1, . . . , m with x*' E and y'^ G { — 1, +1}, discriminative training of 
functions often involve an optimization such as: 

min^/(/,/(x'=))+Ai?(/) (1) 

k 

where, I is a loss function and R{f) is a regularization term. In the classification setting a commonly 
used loss function I is the hinge loss function: 

I {y\fix'')) = max (0, 1 - yV(x')) (2) 

For various kernel SVMs the regularization penalizes the norm of the function in the implicit Repro- 
ducing Kernel Hilbert Space (RKHS), of the kernel. Approximating the RKHS of these additive ker- 
nels provides way of training additive kernel S VM classifiers efficiently. For shift invariant kernels, 
Rahimi and Recht |15| derive features based on Boshner's theorem. Vedaldi and Zisserman flTl 
propose to use the closed form features of Hein and Bousquet |8| to train additive kernel SVMs 
efficiently for many commonly used additive kernels which are 7-homogenous. For the min kernel, 
Maji and Berg 1 1 1 1 propose an approximation and an efficient learning algorithm, and our work is 
closely related to this. 

In the additive modeling setting, a typical regularization is to penalize the norm the dth order deriva- 
tives of the function in each dimension, i.e., R{f) = fti^Y- Our features are based on 
encodings that enable efficient evaluation and computation of this regularization. For further dis- 
cussion we assume that the features x^ are one dimensional. Once the embeddings are derived for 
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the one dimensional case, we note that the overall embedding is concatenation of the embeddings in 
each dimension as the classifiers are additive. 



3 Spline Embeddings 



Eilers and Marx [4| proposed a practical modeling approach for GAMs. The idea is based on the 
representing the functions in each dimension using a relatively large number of uniformly spaced 
B-Spline basis. The smoothness of these functions is ensured by penalizing the first or second 
order differences between the adjacent spline coefficients. Let 4>{x^) denote the vector with entries 
[x*'), the projection of a;*^ on to the ith basis function. The P-Spline optimization problem for the 
classification setting with the hinge loss function consists of minimizing c(w): 



A 



1 



k 



;(0,l-2/'^ (w'0(a;'=))) 



(3) 



The matrix constructs the dth order differences of cx\ 

DdOL = A'^a (4) 

The first difference of a, A^cx is a vector of elements — a^+i. Higher order difference matrices 
can be computed by repeating the differencing. For a n dimensional basis, the difference matrix Di 
is a (n — 1) X n matrix with di,i — 1, d^.i+i 
D2 are as foUows: 



1 and zero everywhere else. The matrices Di and 
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To enable a reduction to the linear case we propose a slightly different difference matrix Di. We let 
Di be a n X n matrix with di^i — 1, di^i-i = — 1. This is same as the first order difference matrix 
proposed by Eilers and Marx, with one more row added on top. The resulting difference matrices 
Di and D2 = are both n x n matrices: 



/ 1 

-1 



V 



-1 1 / 



■,D2 = 



/ 1 

-2 1 
1 



V 



1 

2 1 / 



The first row in Di has the effect of penalizing the norm on the first coefficient of the spline basis, 
which plays the role of regularization in the linear setting (e.g. ridge regression, linear SVMs, etc). 
Alternatively one can think of this as an additional basis at left most point with its coefficient set to 
zero. The key advantage is that the matrix Di is invertible and has a particularly simple form which 
allows us to linearize the whole system. We will also show in Section|4]that the derived embeddings 
also approximate the learning problem of kernel S VM classifier using the min kernel (i^min) for a 
particular choice of spline basis. 



■fi^min(x, y) = ^ mii\{x^,yi) 



(5) 



Given the choice of the regularization matrix Dd which is invertible, one can linearize the whole 
system by re-parametrizing w by D'^^w, which results in : 



c(w) = -w'w + - ^ max (o, I - (^'^'d ^^{^^) 



(6) 



Therefore the whole classifier is linear on the features ct)^{x^) = Z)^ ^4>{x'^), i.e. the optimization 
problem is equivalent to 



c{w) - ^w'w + - XI max (O' 1 - (w'0'*(a;'^) 



(7) 
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Figure 1 : Local basis functions for linear (left), quadratic (middle) and cubic (right) for various 
regularizations degrees d. In each figure cj)'^ refers to the dense features Dj~^cj). When d = 0, the 
function shown in the local basis of B-Splines. When d = r + 1, where r is degree of the B-Sphne 
basis, then (p'^ are truncated polynomials basis, {x — Ti)^ (see Section|4|l. 



The inverse matrices ^ and ^ 
entries di,j = l,j > i and D2^^ = 



( 1 



V 



are both upper triangular matrices. The matrix Dj has 
f 1, j > i and look like: 



has entries di.j = j 



1 \ 
1 

1 
1 

1 ) 



Do 



/ 1 



V 



n — 1 
n - 2 
n — 3 



n 

n — 1 
n - 2 



We refer the readers to |I5] for an excellent review of additive modeling using splines. Figure [T] 
shows the </>'^ for various choices of the regularization degree d = 0, 1, 2 and B Splines basis, linear, 
quadratic and cubic. 



3.1 Generalized Fourier Embeddings 



Generalized Fourier expansion provides an alternate way of fitting additive models. Let 
^2(2;), . . . , ipn{x) be a orthogonal basis system in the interval [a, b], wrto. a weight func- 
tion w{x), i.e. we have tpi{x)'ipj{x)w{x)dx = 0, z 7^ j. Given a function f{x) = cii^pi{x), 
the regularization can be written as: 



f{xfw{x)dx 



{^a^^f{x)^ w{x)dx^J^ ^^a,aJV^''(a;)Vi(x)j (8) 



Consider an orthogonal fatnily of basis functions which are differentiable and whose derivatives are 
also orthogonal. One can normalize 1 
the regularization has a simple form: 



also orthogonal. One can normalize the basis such that 'ipf(x)'>pj{x)w{x)dx — Sij. In this case 



^ f'^{xfw{x)dx = lj2 a,aj^pt{x)il;f{x) j w{x)dx = ^ a ■ 



(9) 



Thus the overall regularized additive classifier can be learned by learning a linear classifier in the 
embedded space ip{x). In practice one can approximate the scheme by using a small number of 
basis function. We propose two practical ones with closed form embeddings: 



Fourier basis. The classic Fourier basis functions {1, cos(7ra;), sin(7rx), cos(27rx), sin(27ra;), . . .} 
are orthogonal in [—1,1], wrto. the weight function w{x) — 1. The derivatives are also in the same 
family (except the constant basis function), hence are also orthonormal. The normahzed feature 
embeddings for c? = 1, 2 are shown in Table [1] 
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Fourier 

X e [-1, 1], w{x) = 1 



02(^2;) — |Cos(mr£) sin(mr£)| 



Hermite 

X € iV(0, 1), = e'^'/^ 



4>l{x)^(t)\{x),4>l{x) 



-^n(n— l)n! 



, n > 1 



Table 1 : Fourier and Hermite encodings 4> for spline regression penalizing the dth derivative. 



Hermite basis. Hermite polynomials also are an orthogonal basis system with orthogonal deriva- 
tives wrto. the weight function e^^ Z^. Using the following identity: 



2 

-oo 



(10) 



and the property that H!^ = nHn-i (Apell sequence), one can obtain closed form features for 
d — 1 , 2 as shown in Table [T] It is also known that the family of polynomial basis functions 
which are orthogonal and whose derivatives are orthogonal belong to one of three families, Jacobi, 
Laguerre or Hermite 1 19]. The extended support of the weight function of the Hermite basis, makes 
them well suited for additive modeling. 

Although both these basis are complete, for practical purposes one has to use the first few basis. 
The quality of approximation depends on how well the underlying function can be approximated by 
these chosen basis, for example, low degree polynomials are better represented by Hermite basis. 



4 Additive Kernel Reproducing Kernel Hilbert Space & Spline Embeddings 

We begin by showing the close resemblance of the spline embeddings to the min kernel. To see this, 
let the features in [0,1) be represented with iV + 1 uniformly spaced linear spline basis centered 
at 0, j^, jj , . . . ,1. Let r — [Nx\ and let a — Nx — r . Then the features 4>{x) is given by 
4>r{x) = 1 — a, 0r+i (x) = a and the features <p^ (x) for Di matrix is given by (j)} (x) — 1, if i < r 
and (j)}.{x) = a. It can be seen that these features closely approximates the min kernel, i.e. 

~ niin(a;,y) + 1 (11) 

The features 4>^{x) — D^^(j){x) constructs a unary like representation where the number of ones 
equals the position of the bin of x. One can verify that for a B-spline basis of degree r (r = 1, 2, 3), 
the following holds; 

l</.i(a;)'0i(y)-min(x,y) + ^,if|a:-y|> ^ (12) 

Define K"^ the kernel corresponding to a B-Spline basis of degree r and regularization matrix as 
follows: 

= - 4^ = l^4>{x)'D-^D-^ct>{y) ^ (13) 

Figure 12] shows K} for r = 1,2,3 corresponding to a linear, quadratic and cubic B-Spline basis. In 
a recent paper, Maji and Berg fTTI, propose to use linear spline basis and a Di regularization, to 
train approximate intersection kernel S VMs, which in turn approximate arbitrary additive classifiers. 
Our features can be seen as a generalization to this work which allows arbitrary spline basis and 
regularizations. 

B-Splines are closely related to the truncated polynomial kernel lfT8l[T4l which consists of uniformly 
spaced knots ti , . . . , t„ and truncated polynomial features: 

c^,{x) = (x - n)l (14) 
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Kmin Kl K^nin K\ ^min K^ -K^min K^ 

Figure 2: Spline Kernels. K^in{x, y),x,y e [0, 1] along with for r = 1,2,3 corresponding 
to linear, quadratic and cubic B-Spline basis. Using uniformly spaced basis separated by 0.1, these 
kernels closely approximate the min kernel as seen in the difference image. The approximation is 
exact when |a; — yj > O.lr. 

However these features are not as numerically stable as B-Spline basis (see |5| for an experimental 
comparison). Truncated polynomials of degree k corresponds to a B-Spline basis of degree k and 
Dk+i regularization, i.e, same as K^'^^ kernel, when the knots are uniformly spaced. This is because 
B-Splines are derived from truncated polynomial basis by repeated application of the difference 
matrix UilSI. As noted by the authors in |5|, one of the advantages of the P-Spline formulation is 
that is decouples the order of regularization and B Spline basis. Typically Di regularization provides 
sufficient smoothness in our experiments. 

5 Optimizations for Efficient Learning for B-Spline embeddings 

For B-spline basis one can exploit the sparsity to speed-up linear solvers. The classification function 
is based on evaluating w' D^^(l){x). Most methods for training linear methods are based on eval- 
uating the classifier and updating the classifier if the classification is incorrect. Since the number 
of such evaluations are larger than the number of updates, it is is much more efficient to maintain 
= D^^yv and use sparse vector multiplication. Updates to the weight vector w and for 
various gradient descent algorithms look like: 

w* ^ w*-i - ^^D^-'^ix'^), = w^-i - riLdH^'') (15) 

Where 77 is a step and Ld = D^^D^^^. Unlike the matrix D'^Dd, the matrix Ld is a dense, and 
hence the updates to may change all the entries of w^. However, one can compute Ld4>{x) in 

2dn steps instead of n? steps, exploiting the simple form of . Initiahze — (t)i{x) then repeat 
step A, d times followed by step B, d times to compute Ld4>{x). 

step A : fli = fli + fli+i , i = n — 1 to 1 
step B : ai — ai + a^-i, i = 2 to n 

6 Experiments 

Often on large datasets consisting of very high dimensional features, to avoid the memory bottle- 
neck, one may compute the encodings in the inner loop of the training algorithm. We refer this to 
as the "online" method. Our solver is based on LIBLINEAR, but can be easily used with any other 
solver The custom solver allows us to exploit the sparsity of embeddings (Section |5]l. A practical 
regularization is Z?o = I with the B-Spline embeddings, where / is the identity matrix, which leads 
to sparse features. This makes it difficult to estimate the weights on the basis functions which have 
no data, but one can use a higher order B-Spline basis, to somewhat mitigate this problem. 

We present image classification experiments on two image datasets, MNIST ifTOll and Daimler 
Chrysler (DC) pedestrians |13|. On these datasets SVM classifiers based on histogram intersec- 
tion kernel outperforms a linear SVM classifier LI 1 12 1, when used with features based on a spatial 
pyramid of histogram of oriented gradients ||2l[9l. We obtain the features from the author's website 
for our experiments. The MNIST dataset has 60, 000 instances and the features are 2172 dimen- 
sional and dense, leading to 130, 320, 000 non-zero entries. The DC dataset has three training sets 
and two test sets. Each training set has 19, 800 instances and the features are 656 dimensional and 
dense, leading to 12, 988, 800 entries. These sizes are typical of image datasets and training kernel 
SVM classifiers often take several hours on a single machine. 
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(a) (b) (c) (d) 

Figure 3: (a) 2D data generated according to + < 1. (b) B-Spline fits along x for various 
degrees and regularizations and 4 uniformly spaced bins in [—1,1]. (c) Fourier fits along x using 
sin, cos features for various frequencies and regularizations. (d) Hermite fits along x for various 
degrees and regularizations. 







Regularization 




Degree 


Do 


Di 








5 bins 




1 


06.60s (89.55%) 


20.27s (89.68%) 


041.60s (89.93%) 


2 


08.74s (90.45%) 


30.47s (90.20%) 


080.25s (89.94%) 


3 


11.68s (90.03%) 


49.85s (89.93%) 


143.50s (88.57%) 






10 bins 




1 


05.61s (90.42%) 


23.06s (90.86%) 


077.99s (89.43%) 


2 


08.10s (90.69%) 


29.97s (90.73%) 


126.03s (89.23%) 


3 


11.59s (90.48%) 


42.26s (90.67%) 


193.47s (89.14%) 






20 bins 




1 


05.96s (90.23%) 


32.43s (91.20%) 


246.87s (89.06%) 


2 


07.26s (90.34%) 


34.99s (91.10%) 


328.32s (88.89%) 


3 


10.08s (90.39%) 


42.88s (91.00%) 


429.57s (88.92%) 



Table 2: B-Spline parameter choices on DC dataset. Training times and accuracies for various B-Spline 
parameters on the first split of DC dataset. In comparison training a linear model using LIBLINEAR takes 
3.8s and achieves 81.5% accuracy. With Dq regularization, higher degree splines are more accurate, but 
with Di regularization linear spline basis are as accurate as quadratic/cubic when the number of bins are 
large. Training using Do regularization is about than 4x faster than Di. Higher order regularization seems 
unnecessary. Note that this implementation computes the encodings on the fly hence has the same memory 
overhead as LIBLINEAR. 



Toy Dataset. The points are sampled uniformly on a 2D grid [—1,1] x [—1,1] with the points 
satisfying + < 1 in the positive class and others as negative. Figure |3lb), shows the fits on 
the data along x (or y) dimension using 4 uniformly spaced B-spline basis of various degrees and 
regularizations. The quadratic and cubic splines offer smoother fits of the data. Figure|3lc,d) shows 
the learned functions using Fourier and Hermite embeddings of various degrees respectively. 



Effect of B-Spline parameter choices. Table |2] shows the accuracy and training times as a func- 
tion of the number of bins, regularization degree (D^, r — 0, 1, 2) and the B-Spline basis degree 
(d = 1, 2, 3) on the first split of the DC pedestrian dataset. We set C = 1 and the bias term B = 1 
for training all the models. On this dataset we find that r = 0, 1 is more accurate than r = 3 and 
is significantly faster In further experiments we only include the results of r — 0,1 and d — 1,3. 
In addition, setting the regularization to zero (r = 0), leads to very sparse features and can be used 
directly with any linear solver which can exploit this sparsity. The training time for B-Splines scales 
sub-linearly with the number of bins, hence better fits of the functions can be obtained without much 
loss in efficiency. 



Effect Fourier embedding parameter choices. Table [3] shows the accuracy and training times 
for various Fourier embeddings on DC dataset. Before computing the generalized Fourier features, 
we first normalize the data in each dimension to [—1, 1] using: 

, -^max ^" '^min e '^max ^min /t ^\ 

x< — , where, /i= ,o = (16) 
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Fourier 


Hermite 




r = 


1 


r = 


2 


r = 


1 


r — 


2 


Degree 


Accuracy 


Time 


Accuracy 


Time 


Accuracy 


Time 


Accuracy 


Time 


1 


88.94% 


07.0s 


88.94% 


07.0s 


84.17% 


02.8s 


84.17% 


02.8s 


2 


89.59% 


10.2s 


89.64% 


10.2s 


88.01% 


04.6s 


88.01% 


04.6s 


3 


88.99% 


12.7s 


89.77% 


12.8s 


88.22% 


07.7s 


88.70% 


09.9s 


4 


89.77% 


16.0s 


89.84% 


15.9s 


89.00% 


12.6s 


89.05% 


11.9s 



Table 3: Fourier embeddings on DC dataset. Accuracies and training times (encoding + training) for various 
Fourier embeddings on the first split of the DC pedestrian dataset. Note that the methods are trained by first 
encoding the features and training a linear model using LIBLINEAR. 



Method 


Test Accuracy 


Training Time 


SVM (linear) + LIBLINEAR 


81.49(1.29) 




3.8s 


SVM (min) + LIBSVM 


89.05(1.42) 




363.1s 






online 


batch 


B-Spline (r = 0, d = 1, n = 05) 


88.51 (1.35) 


5.9s 




B-Spline (r = 0,d = 3,n = 05) 


89.00(1.44) 


10.8s 




B-Spline (r 1, d = 1, n = 10) 


89.56(1.35) 


17.2s 




B-Spline (r = l,d = 3,n = 10) 


89.25(1.39) 


19.2s 




Fourier (r = 1, d = 2) 


88.44(1.43) 


159.9s 


12.7s (4 X memory) 


Hermite (r = 1, d = 4) 


87.67(1.26) 


35.5s 


12.6s (4x memory) 



Table 4: Training times and test accuracies of various additive classifiers on DC dataset. The online 
training method compute the encodings on the fly and the batch method computes the encoding once and uses 
LIBLINEAR to train a linear classifier. For B-Splines the online method is faster hence we omit the training 
times for the batch method. All the additive classifiers outperform the linear SVM, while being up to 50 x 
faster than min SVM. 

We precompute the features and use LIBLINEAR to train various models, since it is relatively more 
expensive to compute the features online. In this case the training times are similar to that of B- 
Spline models. However, precomputing and storing the features may not be possible on very large 
scale datasets. 

Comparison of various additive models. Table|4] shows the accuracy and training times of vari- 
ous additive models compared to linear and the more expensive min kernel SVM on all the 6 training 
and test set combinations of DC dataset. The optimal parameters were found on the first training and 
test set. The additive models are up to 50 x faster to train and are as accurate as the min kernel SVM. 
The B-Spline additive models significantly outperform a linear SVM on this dataset at the expense 
of small additional training time. 

Table |5] shows the accuracies and training times of various additive models on the MNIST dataset. 
We train one-vs-all classifiers for each digit and the classification scores are normalized by passing 
them through a logistic. During testing, each example given the label of the classifier with the 
highest response. The optimal parameters for training were found using 2-fold cross validation on 
the training set. Once again the additive models significantly outperform the linear classifier and 
closely matches the accuracy of min kernel SVM, while being 50 x faster. 



Method 


Test Error 


Training Time 


SVM (linear) + LIBLINEAR 
SVM (min) + LIBSVM 


1.44% 
0.79% 


6.2s 
^2.5 hours 


B-Spline (r = 0,d = l,n = 20) 
B-Spline (r = 0, d = 3, n = 20) 
B-Spline (r = 1, d = 1, n = 40) 
B-Spline (r = 1, d = 3, n = 40) 
Hermite (r = 1, d = 4) 


0.88% 
0.86% 
0.81% 
0.82% 
1.06% 


31.6s 
51.6s 
157.7s 
244.9s 
358.6s 



Table 5: Test error and mean training times per digit for various additive classifiers on MNIST. For 

B-Splines the online method is faster hence we omit the training times for the batch method. All the additive 
classifiers outperform the linear SVM, while being up to 50 x faster than min SVM. 
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7 Conclusion 



We have proposed a family of embeddings which enable efficient learning of additive classifiers. We advocate 
the use B-Splines based embeddings because they are are efficient to compute and are sparse, which enables 
us to train these models with a small memory overhead by computing the embeddings on the fly even when 
the number of basis are large and can be seen as a generalization of 1 1 1 1. Generalized Fourier features are low 
dimensional, but are expensive to compute and so are more suitable if the projected features can be precomputed 
and stored. The proposed classifiers outperform linear classifiers and match the significantly more expensive 
kernel SVM classifiers at a fraction of the training time. On both the MNIST and DC datasets, linear B-Spline 
and Di regularization works the best and closely approximates the learning problem of min kernel SVM. 
Higher degree splines are useful when used with Do regularization, have even faster training times but worse 
accuracies than Di regularization. The code for training various spline models proposed in the paper has been 
packaged as a library, LIBSPLINE, which will be released upon the publication of this paper. 
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